Tuesday June 12, 2018

Today's focus

What data to use in introductory statistics and data science courses?

Ideally data that's:

  1. Rich enough to answer meaningful questions with
  2. Real enough to ensure that there is context
  3. Realistic enough to convey to the reality of much of the world's data

One goal

On the one hand, Cobb (2015) argues that we should

  1. "Teach through research"
  2. "Minimize prerequisites to research"

Another goal

Analogy for second goal

Two conflicting goals

  • On the one hand: Minimize prerequisites to research
  • On the other: Do not betray reality of data as it exists in much of the world

Back to analogy

In other words, a balancing act is required between:

Data with no prerequisites needed Data as it exists "in the wild"
Drawing Drawing

Data "taming"

Data "taming" sets out to balance:

  • On the one hand: Performing enough pre-processing so that data is accessible to R novices
  • On the other: Not performing so much pre-processing as to betray the reality of data as it exists "in the wild"

"Tame" data principles

We propose the following "tame" data principles to remove biggest hurdles R novices face:

  1. Clean variable names
  2. Identification variables in left-hand columns
  3. Clean dates
  4. Logically ordered categorical variables
  5. Consistent "tidy" format

fivethirtyeight package

In the fivethirtyeight R package, Chester Ismay, Jennifer Chunn, and I:

  • Take FiveThirtyEight's raw article data from GitHub
  • Pre-process the raw data so that it follows "tame" data principles
  • Make the tame data, documentation, and original article easily accessible via an R package

Examples

Following examples involve code, so I suggest you follow in HTML version of slides:

  1. In your browser, go to bit.ly/causeweb_tame
  2. In the left-hand menu, click on "Principle 1: Clean variable names"

Principle 1: Clean variable names

a) Comparing raw and tamed data

library(readr)
library(fivethirtyeight)

# Raw data: variable names are unwieldy & have spaces
flying_raw <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/flying-etiquette-survey/flying-etiquette.csv")
colnames(flying_raw)[c(5, 19)]
## [1] "Do you have any children under 18?"               
## [2] "In general, is itrude to bring a baby on a plane?"
# Tamed data: corresponding variable names are cleaner
colnames(flying)[c(5, 18)]
## [1] "children_under_18" "baby"

b) Why should we care?

Working with variables names that are long/unwieldy and have spaces is a tricky.

mosaicplot(~ `Do you have any children under 18?` + `In general, is itrude to bring a baby on a plane?`, 
           data = flying_raw,  main = "Raw data",
           xlab = "Have a baby?", ylab = "Is it rude?")
mosaicplot(~ children_under_18 + baby,
           data = flying,  main = "Tamed data",
           xlab = "Have a baby?", ylab = "Is it rude?")

Principle 2: ID variables

More organizational. Any identification variables that uniquely identify the observations/rows should be place in the left-hand columns since they are of highest prominence. Such variables are used to key joins/merging of datasets.

library(fivethirtyeight)

# Both title and imdb site tag uniquely identify movies. Show only 8 first
# columns and 3 first rows of dataset:
biopics[1:3, 1:8]
## # A tibble: 3 x 8
##   title   site   country year_release box_office director number_of_subje…
##   <chr>   <chr>  <chr>          <int>      <dbl> <chr>               <int>
## 1 10 Ril… tt006… UK              1971        NA  Richard…                1
## 2 12 Yea… tt202… US/UK           2013  56700000. Steve M…                1
## 3 127 Ho… tt154… US/UK           2010  18300000. Danny B…                1
## # ... with 1 more variable: subject <chr>
# episode variable uniquely identifies episodes of "The Joy of Painting". Show
# only 8 first columns and 3 randomly chosen rows of dataset using dplyr package
library(dplyr)
bob_ross %>% 
  select(1:8) %>% 
  sample_n(3)
## # A tibble: 3 x 8
##   episode season episode_num title apple_frame aurora_borealis  barn beach
##   <chr>    <dbl>       <dbl> <chr>       <int>           <int> <int> <int>
## 1 S20E05     20.          5. DIVI…           0               0     0     0
## 2 S23E11     23.         11. FROZ…           0               0     0     0
## 3 S14E04     14.          4. SNOW…           0               0     0     0

Principle 3: Dates

a) Comparing raw and tamed data

library(readr)
library(dplyr)
library(fivethirtyeight)

# Raw data: year, month, day are separate variables
US_births_1994_2003_raw <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv")
head(US_births_1994_2003_raw)
## # A tibble: 6 x 5
##    year month date_of_month day_of_week births
##   <int> <int>         <int>       <int>  <int>
## 1  1994     1             1           6   8096
## 2  1994     1             2           7   7772
## 3  1994     1             3           1  10142
## 4  1994     1             4           2  11248
## 5  1994     1             5           3  11053
## 6  1994     1             6           4  11406
# Tamed data: variable date of type "date" included
head(US_births_1994_2003)
## # A tibble: 6 x 6
##    year month date_of_month date       day_of_week births
##   <int> <int>         <int> <date>     <ord>        <int>
## 1  1994     1             1 1994-01-01 Sat           8096
## 2  1994     1             2 1994-01-02 Sun           7772
## 3  1994     1             3 1994-01-03 Mon          10142
## 4  1994     1             4 1994-01-04 Tues         11248
## 5  1994     1             5 1994-01-05 Wed          11053
## 6  1994     1             6 1994-01-06 Thurs        11406

b) Why should we care?

Without a variable of type date, making time series plots is difficult.

# Use filter command from dplyr package for data wrangling
US_births_1999 <- US_births_1994_2003 %>%
  filter(year == 1999)

# Plot time series via base R:
plot(x = US_births_1999$date, y = US_births_1999$births, type = "l", 
     xlab = "Date", ylab = "Number of births", main = "1999 US Births")

Principle 4: Categorical variables

a) Comparing raw and tamed data

library(readr)
library(ggplot2)
library(fivethirtyeight)
bechdel_raw <- read_csv("https://raw.githubusercontent.com/rudeboybert/fivethirtyeight/master/data-raw/bechdel/movies.csv")

# Raw data: categorical variable clean_test is saved as characters/strings
bechdel_raw$clean_test[1:5]
## [1] "notalk" "ok"     "notalk" "notalk" "men"
# Tamed data: clean_test is saved as factor
bechdel$clean_test[1:5]
## [1] notalk ok     notalk notalk men   
## Levels: nowomen < notalk < men < dubious < ok

b) Why should we care?

R by default plots characters in alphabetical order, whereas with factors we can set the order of the levels; reordering a categorical variable/factor in R is tough, especially for new R users. In this case, we can have the bars ordered along the hierarchical nature of Bechdel test:

# Using raw data:
ggplot(bechdel_raw, aes(x = clean_test)) +
  geom_bar() +
  labs(x = "Bechdel test outcome", y = "count", title = "Raw data")

# Using tamed data:
ggplot(bechdel, aes(x = clean_test)) +
  geom_bar() +
  labs(x = "Bechdel test outcome", y = "count", title = "Tamed data")

Principle 5: "Tidy" data format

"Tidy" data format is narrow/long format, as opposed to wide. This format is chosen for input/output data frame standardization across many R packages in the tidyverse: ggplot2, dplyr, etc. There are three interrelated rules which make a dataset "tidy":

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.
Drawing

a) Comparing raw and tamed data

library(dplyr)
library(ggplot2)
library(fivethirtyeight)

# In fivethirtyeight package drinks data is kept in original non-tidy (wide) format
head(drinks)
## # A tibble: 6 x 5
##   country   beer_servings spirit_servings wine_servings total_litres_of_p…
##   <chr>             <int>           <int>         <int>              <dbl>
## 1 Afghanis…             0               0             0              0.   
## 2 Albania              89             132            54              4.90 
## 3 Algeria              25               0            14              0.700
## 4 Andorra             245             138           312             12.4  
## 5 Angola              217              57            45              5.90 
## 6 Antigua …           102             128            45              4.90
# tidyr::gather() code to convert to tidy format in help file: ?drinks
library(tidyr)
drinks_tidy <- drinks %>%
  gather(type, servings, -c(country, total_litres_of_pure_alcohol)) %>% 
  arrange(country)
head(drinks_tidy)
## # A tibble: 6 x 4
##   country     total_litres_of_pure_alcohol type            servings
##   <chr>                              <dbl> <chr>              <int>
## 1 Afghanistan                         0.   beer_servings          0
## 2 Afghanistan                         0.   spirit_servings        0
## 3 Afghanistan                         0.   wine_servings          0
## 4 Albania                             4.90 beer_servings         89
## 5 Albania                             4.90 spirit_servings      132
## 6 Albania                             4.90 wine_servings         54
ggplot(drinks_tidy, aes(x = type, y = servings)) + 
  geom_boxplot() +
  labs(x = "Alcohol type", y = "Number of servings", title = "Worldwide alcohol consumption")

Advanced example

a) Comparing raw and tamed data

In the tamed pres_2016_trail data frame we:

  1. Ensured lat and lng were in numerical format, not in degree/minute/second, North/South, and East/West format (A variation on Principle 3: Dates)
  2. Combined both CSV's into one and added variable candidate (Principle 5: Tidy data format)
library(dplyr)
library(fivethirtyeight)

# Tamed data: 
pres_2016_trail %>% 
  arrange(date) %>% 
  head()
## # A tibble: 6 x 5
##   candidate date       location             lat   lng
##   <chr>     <date>     <chr>              <dbl> <dbl>
## 1 Trump     2016-09-01 Wilmington, OH      39.4 -83.8
## 2 Trump     2016-09-03 Detroit, MI         42.3 -83.0
## 3 Clinton   2016-09-05 Cleveland, Ohio     41.5 -81.7
## 4 Clinton   2016-09-05 Hampton, Illinois   41.6 -90.4
## 5 Clinton   2016-09-06 Tampa, Florida      28.0 -82.5
## 6 Trump     2016-09-06 Virginia Beach, VA  36.9 -76.0

b) Why should we care?

So we can easily create a faceted map!

library(ggplot2)
library(maps)
ggplot(data = pres_2016_trail, aes(x = lng, y = lat)) +
  facet_wrap(~candidate) +
  geom_point(col = "black", size = 2) + 
  coord_map() + 
  # Override data & aes()thetic mapping set above to trace path of state outlines:
  geom_path(data = map_data("state"), aes(x = long, y = lat, group = group), size = 0.1)

Comments

  • Analogy I heard that I like: fivethirtyeight is like a data petting zoo
  • No "universal" balance of two goals: it will vary depending on your students' experience, requirements, and needs
  • Tame data principles and fivethirtyeight can be used in other contexts: 1) intermediate-level data science courses and 2) advanced projects

Used in data science courses

Used for advanced projects

Other resources